On near-uniform URL sampling
نویسندگان
چکیده
We consider the problem of sampling URLs uniformly at random from the Web. A tool for sampling URLs uniformly can be used to estimate various properties of Web pages, such as the fraction of pages in various Internet domains or written in various languages. Moreover, uniform URL sampling can be used to determine the sizes of various search engines relative to the entire Web. In this paper, we consider sampling approaches based on random walks of the Web graph. In particular, we suggest ways of improving sampling based on random walks to make the samples closer to uniform. We suggest a natural test bed based on random graphs for testing the effectiveness of our procedures. We then use our sampling approach to estimate the distribution of pages over various Internet domains and to estimate the coverage of various search engine indexes. 2000 Published by Elsevier Science B.V. All rights reserved.
منابع مشابه
Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results
The study of identification of near duplicate content involves identifying search categories which generate same URL in a query result. These categories are needed to be identified so that results can be improved by removing duplicate URL. Generating same URL in results irritates the user and it also decreases priority of other URL. These URL displayed on second or third page which user do not ...
متن کاملNear-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
We propose a new technique for sampling the solutions of combinatorial problems in a near-uniform manner. We focus on problems specified as a Boolean formula, i.e., on SAT instances. Sampling for SAT problems has been shown to have interesting connections with probabilistic reasoning, making practical sampling algorithms for SAT highly desirable. The best current approaches are based on Markov ...
متن کاملSample identification in hip-hop music
Sampling is a creative tool in composition that is widespread in popular music production and composition since the 1980?s. However, the concept of sampling has for a long time been unaddressed in Music Information Retrieval. We argue that information on the origin of samples has a great musicological value and can be used to organise and disclose large music collections. In this paper we intro...
متن کاملFrom Sampling to Model Counting
We introduce a new technique for counting models of Boolean satisfiability problems. Our approach incorporates information obtained from sampling the solution space. Unlike previous approaches, our method does not require uniform or near-uniform samples. It instead converts local search sampling without any guarantees into very good bounds on the model count with guarantees. We give a formal an...
متن کاملINTERNET - DRAFT UC Irvine
A Uniform Resource Locator (URL) is a compact representation of the location and access method for a resource available via the Internet. When embedded within a base document, a URL in its absolute form may contain a great deal of information which is already known from the context of that base document’s retrieval, including the scheme, network location, and parts of the url-path. In situation...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computer Networks
دوره 33 شماره
صفحات -
تاریخ انتشار 2000